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(54) Apparatus and methods for sharing idle workstations 



(57) The present invention relates to systems for 
sharing idle workstation computers that are connected 
together through a network and shared file system. 
More particularly, a user of a local host workstation may 
submit jobs for execution on remote workstatrons. The 
systems of the present invention select a remote host 
that is Idle in accordance with a decentralized schedul- 
ing scheme and then continuously monitor the activity 
of the remote host on which the job is executing. If the 



system detecte certain activity on the remote host by 
one of the remote host's primaiy users, the execution of 
the job is immediately suspended to prevent inconven- 
ience to the primary users. The system also suspends 
job execution if the remote host's load average gets too 
high. Either way, the suspended job is migrated by se- 
lecting another idle remote workstatbn to resume exe- 
cution of the suspended job (from the point in time at 
which the last checkpoint occurred). 
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Description 

Cross Inference to ReiatGd Applications 

The present invention is related to the foltowing In- 
ternational Patent Applications: "Persistent State 
Checkpoint And Restoration Systems," PCT Patent Ap- 
plication No. PCT/US95/07629 and "Checkpoint And 
Restoration Systems For Execution Control," PCT Pat- 
ent Application No. PCT/US95/07660, both of which are 
assigned to the assignee of the present invention and 
both of which are incorporated herein by reference. Ad- 
ditionally, the following articles are also incorporated 
herein by reference: Y.M. Wang et al.. "Checkpointing 
and Its Applications," Symposium on Fault-Tolerant 
Computing. 1995, and G.S. Fowler, "The Shell as a 
Service," USENIX Conference Proceedings . June 



Background of The Invention 

The present inventbn relates to networked compu- 
ter systems. In particular, the present invention relates 
to computing environments that are populated with net- 
worked workstations sharing network file systems. One 
known system for sharing idle resources on networked 
computer systems is the Condor system that was devel- 
oped at the University of Wisconsin (idle resources refer, 
generally, to a networked workstation having no user in- 
put commands for at least a certain period of time). The 
Condor system is more fully described in M. Litzkow el 
al., "Condor - a hunter of klle workstatfons.* Proc. ICD- 
CS,pp. 104-111.1988. 

Some of the main disadvantages of the Condor sys- 
tem are as follows. Condor uses a centralized scheduler 
(on a centralized server) to allocate network resources, 
which Is a potential security risk. For example, having a 
centralized server requires that the server have root 
privileges in order for it to create processes that imper- 
sonate the individual users who submit jobs to it for re: 
mote execution. Any lack of correctness in the coding 
of the centralized server (e.g.. any "bugs") may allow 
someone other than the authorized user to gain access 
to other user's privileges. Secondly, the Condor system 
migrates the execution of a job on a remote host as soon 
as it detects any mouse or keyboard activity (i.e., user 
inputs) on the remote host. Therefore, if any user, in- 
cluding users other than the primary user, begins using 
the remote host, shared execution is terminated and the 
task is migrated. This causes needless migratfon and 
its concomitant work lossage. Thirdly, due to the nature 
of a centralized server, starting the sen/er can only be 
accomplished by someone with root privileges. 

Another known system for accomplishing resource 
sharing is the Coshell system (which is described more 
fully in the G.S. Fowler article "The shell as a servtee," 
incorporated by reference above). Coshell, unfortunate- 
ly also has disadvantages, such as the fact that Coshell 



also suspends a job on the remote host whenever the 
remote host has any mouse or keyboard activity Addi- 
tionally, Coshell cannot migrate a suspended job (a job 
that was executing remotely on a workstation and was 

s suspended upon a user input at the remote workstatton) 
to another machine, but has to wait until the mouse or 
keyboard activity ends on the remote host before resum- 
ing execution of the job. The fact that Coshell suspends 
the job's execution in response to any mouse or key- 

'0 board activity creates needless suspensions of the job 
when any mouse or keyboard activity is sensed at the 
remote host. 

It would therefore be desirable to provide systems 
and methods of mote efficiently sharing computational 
'« resources between networked workstatkans. 

It would also be desirable to provKle program exe- 
cution on idle, remote workstations in which suspension 
of the progreim executKKi is reduced to further increase 
processing cifficiency. 
20 It woukJ be still further desirable to provide a net- 
work resource sharing architecture in which suspended 
jobs may be efficiently migrated to alternate idle work- 
stations when the initial idle, remote workstatkxi is once 
again in use by the primary user 

25 

Summarv of the Invention 

The above and other objects of the invention are 
accomplished by providing methods for increasing the 

30 efficiency in sharing computational resources in a net- 
work architecture having multiple workstations in which 
at least one of those workstations is kJle. The present 
Invention prcvides routines that, once a job is executing 
remotely, do not suspend that job unless the primary us- 

3S er "retakes possession" of the workstation (versus any 
user other than the primary user). Additionally, the 
present Invention provides the capability to migrate a re- 
mote job, once it has been suspended, to another idle 
workstation in an efficient manner, rather than requiring 
that the job remain on the remote workstation in a sus- 
pended state until the remote workstation becomes idle 



Brief Descriction of the Drawings 

45 

The above and other objects of the present Inven- 
tion will be apparent upon consideration of the foltowing 
detailed description, taken in conjunction with the ac- 
companying drawings, in which like reference charac- 
so ters refer to like parts throughout, and in which: 

FIG. 1 is an Illustrative schematic diagram that de- 
pfcts substantially all of the processes that are typ- 
ically active when a single user, on the local host, 
« has a single job running on a remote host in accord- 
ance with the principles of the present invontkin; 
and 

FIG. 2 is a table that depicts a representative sam- 
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pie of an attributes file in accordance with tf\e prin- 
ciples of the present invention. 

Detailed Description of The Invention 

FIG. 1 is an illustrative schematic diagram that de- 
picts substantially all of the processes that are typically 
active when a single user (whom shall be referred to as 
user X), on the local host, has a single job running on a 
remote host. FIG. 1 is divided into halves 100 and 101, 
where half 100 depicts processes 102-105 running on 
the local host 100, and half 101 depicts processes 
106-110 running on a remote host 101 . Thus, user X on 
local host 100 (i.e., the left half of FIG. 1) has a single 
job 109 running on a remote host 101 (i.e., the right half 
of FIG, 1). 

Hosts 100 and 101 are connected together through 
a network and share a file system. While FIG. 1 illus- 
trates a network having only two hosts, it should be un- 
derstood that the techniques of the present invention will 
typically be used in computing environments where 
there are multiple hosts, all being networked together, 
either directly together or in groups of networks (e.g., 
multiple local area networks, or LANs, or through a wide 
area network, or WAN), and sharing a file system. 

The process by which a user, such as user X, sub- 
mits a job for execution on a remote host is as follows, 
as illustrated by reference to FIG. 1. The following ex- 
ample assumes a point in time at which none of the proc- 
esses described in FIG. 1 have been started. In this il- 
lustrative example, the principles of the present inven- 
tion are applied to the Coshell system that was de- 
scribed and incorporated by reference above. 

Initially, a user starts a coshell process (a coshell 
process Is a process that automatically executes shell 
actions on lightly loaded hosts in a local network) as a 
daemon process (a process that, once started by a user 
on a workstation, does not terminate unless the user ter- 
minates it) on the user's local host. Thus, in FIG. 1 , user 
X starts coshell process 104 as a daemon process on 
local host 100. Every coshell daemon process serves 
only the user who has started it. For example, while an- 
other user Y may tog onto host 100, only user X can 
communicate with coshell process 104. 

Upon being started, a coshell daemon process 
starts a status daemon (unless a status daemon is al- 
ready running), on its own local host, and on every other 
remote host to which the coshell may submit a job over 
the network. In the case of FIG. 1 , coshell 1 04 has start- 
ed status daemon 105 running on host 100 and status 
daemon 110 running on host 101 . 

Each status daemon collects certain current infor- 
mation regarding the host it is running on (i.e.. status 
daemon 105 collects information about host 100, while 
status daemon 110 collects information about host 101) 
and posts this information to a unique status file whkjh 
is visible to all the other networked hosts. This current 
information comprises: (i) the one minute toad average 
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of the host as determined by the Unix command uptime, 
and (ii) the time elapsed since the last activity on an input 
devKe. either directly or through remote login, by a pri- 
mary user of the host. All of the status files of all the 
s status daemons are kept in a central directory. A central 
directory is used to reduce the network traffic that wouW 
othenwise occur if. for example, every status daemon 
had to post its current information to every other host on 
the netwoi k. In particular, a central directory of status 
'0 files causeis the network traffic to scale linearly with the 
number of hosts added. Furthermore, each status dae- 
mon posts its current information every 40 seconds plus 
a random fraction of 10 seconds ~ the random fraction 
is intended to further relieve network congestion by tern- 
's porarily distributing the posting of current information. A 
coshell uses these status files to select an appropriate 
remote host upon which to run a job submitted to it by 
the coshell's user. 

Each coshell selects a remote host independently 
^0 of all of the other coshells that may be running and is 
therefore said to implement a form of decentralized 
scheduling. The coshells avoid the possibility of all sub- 
mitting their jobs to the same remote host by having a 
random smo<Jthing component in their remote host se- 
2S lection algorithm. Rather than choosing a single best re- 
mote host, each coshell chooses a set of candidate re- 
mote hosts and then picks an individual host from within 
that set rand(5mly. 

A coshell chooses a set of candidate remote hosts 
30 based upon the current informatbn in the status files 
and the statk; information in an attributes file. There is 
a single attril3utes file shared by all the coshells. For 
each host ca|:>able of being shared by the present inven- 
tion, the attributes file typically stores the following al- 
as tributes: type, rating, Mle, mem and puser. An illustrative 
example of an attributes file is shown in the table of FIG. 
2. Each line of FIG. 2 contains, from left to right, the host 
name followed by assignments of values to each of the 
attributes for that host. 
''o The type attribute differentiates between host 
types. For example, the "sgi.mips" in line 1 of FIG. 2 in- 
dicates a host of the Silicon Graphics workstation type, 
while "sun4" indreates a Sun Microsystems workstation 
type. The rating attribute specifies the speed for a host 
« in MIPS (millions of instructions per second). All hosts 
have preferably had their MIPS rating evaluated accord- 
ing to a common benchmark program. This ensures the 
validity of Ml='S comparisons between hosts. The Idle 
attribute spec ifies the minimum time which must elapse 
so since the last primary user activity on a host (as posted 
in the host's status file) before the present invention can 
use the host as a remote host. The mem attribute de- 
scribes the size (in megabytes) of the main memory of 
the host, while the puser attribute describes the primary 
ss users of the host. For example, host "banana" of FIG. 2 
has two primary users indicated by the quoted list "em- 
erak) ymwang," while host "orange" only has the primary 
user "ruby" (no quotes are necessary when there is only 
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one primary user). 

In general, the primary users attribute of a host is 
simply a subset of the universe of potential users of the 
host, selected as primary because they should be ac- 
corded extra priority in having access to the host's re- 
sources. 

Typically, if the host is physically located in a partic- 
ular person's office, then the primary users of that host 
would include the occupant of that office as well as any 
administrative assistants to the occupant. For certain 
computer installations, it is possible to ascertain the oc- 
cupant of the office containing the host from the name 
of the home directory on the host's physically attached 
disk. The names of these home directories can be col- 
lected automatically, over the network, thereby making 
It easier to create and maintain an attributes file. 

In addition to, or as an alternative to, the occupant 
of an office containing the host, the primary users of a 
host may include anyone who logs into the host from the 
host's console (where the console comprises the key- 
board and video screen which are physically attached 
to the host). 

Included among any other qualifications required of 
a remote host for it to be included in the set of candidate 
remote hosts chosen by a coshell, is the requirement 
that the remote host be kJle. A remote host is considered 
idle if all of the following three conditions are satisfied: 
(i) the one minute load average (as posted in the host's 
status file) is less than a threshold (with the threshold 
typically being approximately 0.5); (ii) the time elapsed 
since the last primary user activity (as posted in the 
host's status file) is less than the period of time specified 
for the host by the idle attribute in the attributes file 
(where the period of time is typically approximately 15 
minutes); and (iii) no nontrivial jobs are being run by a 
primary user of the remote host. 

A nontrivial job is typically detemiined as follows. 
The Unix command "ps" (which means processor sta- 
tus) is executed periodically with the execution of the 
two ps commands separated from each other by about 
one minute. An execution of the ps command returns, 
for each process on the host, a flag indicated whether 
or not the process was running as of the time the ps 
command was executed. If for both executions of ps the 
flags for a process indicate that the process was run- 
ning, then the process is nontrivial. If a process is indi- 
cated as having stopped, for at least one of the two ps 
executions, then the process is considered trivial. 

Once a user has a coshell daemon running, the user 
can start a particular job running on a remote host by 
starting a submit program. The submit program takes 
as arguments the job's name as well as any arguments 
whKh the job itself requires. The job name can identify 
a system command, applicatkjn program object file or a 
shell script. The job name can also be an alias to any of 
the items listed in the previous sentence. In FIG. 1, user 
X has started submit process 1 02 running with the name 
for job 1 09 as an argument. Asubmit process keeps run- 



ning until the job passed to it has finished executing on 
a remote hosl Therefore, submit process 102 keeps 
running until job 109 is finished. 

If, for excimple, user X wants to run another job re- 
motely, user X will have to start another submit process. 
Each such submit process keeps running until the job 
passed to it as an argument has finished executing on 
its remote host. Thus, there may be several submit proc- 
esses running at the same time for the same user on a 
given local host (e.g., on local host 100). 

The first action a submit process takes upon being 
started is to sipawn a coshell client process. For exam- 
ple, submit process 102 has started client process 103 
through spavming action 113. As used throughout the 
present applteation, spawning refers to the general 
process by wliich a parent process (e.g. , submit process 
102) creates a duplicate child process (e.g., client proc- 
ess 103). In Unix, the spawning action is accomplished 
by the fork system call. A client process provkJes an out- 
put on the local host for the standard error and standard 
output of the job executing on a remote host. A client 
process also has two way communication with its 
coshell via command and status pipes. Client process 
1 03 receives the standard error and the standard output 
of job 1 09. In addition, client process 1 03 communicates 
with its coshell 104 via command pipe 114 and status 
pipe 115. 

Once the client process has been started, its coshell 
then proceeds to select a remote host upon which to run 
the submitted job. As discussed above, each coshell se- 
lects a remote host independently of all the other 
coshells and utilizes random smoothing to prevent the 
overtoading of any one remote host. If there are no idle 
remote hosts for a coshell to execute a submitted job 
upon, the coshell will queue the job until an idle remote 
host becomes available. Coshell 104, in the example 
shown in FIG. 1, has selected idle remote host 101 to 
run the submitted job 

Having selected a remote host, the coshell next 
starts a shell on the remote host - unless the coshell 
already has a shell running on that remote host left over 
from a previous job submission by that coshell to that 
same remote host. Assuming that the coshell does not 
already have a shell running on the selected remote 
host, it will start one with the Unix command rsh. A 
coshell has tvro way communicatton with the shell it has 
created via a command pipe and a status pipe. In FIG. 
1, coshell 104 has started a shell ksh 106 on remote 
host 101. Coshell 104 and ksh 106 communicate via 
command pipe 116 and status pipe 120. 

It should be noted that the type of shell started on 
a remote host by a coshell, referred to as a ksh, differs 
from an ordinary Unix shell only in its ability to send its 
standard output and standard error (produced by a re- 
nfiotely submitted job) over the network and back to its 
coshell. The coshell then routes this standard output 
and standard error to the correct client process. In the 
case of FIG. 1 , the standard output and standard error 
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of job 109, which is running in ksh 106. is sent over the 
networl( to coshell 104. Coshell 104 then routes this 
standard output and standard error to client process 
103. 

Once a l«h is running on the selected remote host, 
a program called LDSTUB is started under the ksh. An 
LDSTUB process is started under the ksh for each job, 
from the ksh's coshell, which is to be executed on the 
selected remote host. In the case of FIG. 1, LDSTUB 
process 107 Is started under ksh 106. Upon being start- 
ed, the LDSTUB process begins the execution of the 
user's job and a monitor daemon by spawning both of 
them off. Thus, LDSTUB process 107 has started mon- 
itor daemon 108 through spawn 119 and has started us- 
er Vs job 109 through spawn 118. 

A rronitor daemon is a fine-grain polling process. It 
checks for the following conditksns on the selected re- 
mote host: (i) any activity on an input device, either di- 
rectly or through remote login, by a primary user of the 
selected remote host; (ii) any nontrivial jobs being run 
by a primary user of the selected remote host; or (ill) a 
one minute ksad average on the selected remote host 
(as determined by the monitor daemon itself running the 
Unix command uptime) which is equal to or greater than 
a partKular threshoW (typically the threshold is about 
3.5). A monitor daemon is a fine-grain polling process 
because it frequently checks for the above three condi- 
tions, typically checking every 10 seconds. If one or 
more of the above three conditions is satisfied, the mon- 
itor daemon sends a signal to its LDSTUB process. This 
signal to its LDSTUB process starts a process, de- 
scribed betow, which migrates the user's job to another 
remote host. In the example shown in FIG. 1, monitor 
daemon 108 checks for each of the above three condi- 
tions on remote host 101. 

The user's job is usually run on the remote host hav- 
ing been already linked with the LIbckp checkpointing 
library. LIbckp Is a user-transparent checkpointing li- 
brary for Unix applications. Libckp can be linked with a 
user's program to periodically save the program state 
on stable storage without requiring any modification to 
the user's source code. Thecheckpointed program state 
includes the foltowing in order to provkle a truly trans- 
parent and consistent migration, (i) the program coun- 
ter; (11) the stack pointer; (iii) the program stack; (iv) open 
file descriptors; (v) global or static variables; (vi) the dy- 
namically allocated memory of the program and of the 
libraries linked with the program; and (vii) all persistent 
state (i.e., user files). User x's job 109, in this example, 
has been linked with Libckp library 111. 

Each of the processes shown in FIG. 1, namely 
processes 102-110. are active at this point in the job 
submission process. 

Up to this point in the description of the job submis- 
sion process, the focus has been on presenting the 
process by which a single job is submitted for remote 
execution by a single user. It should be noted, however, 
that a user of a kx;al host may start a second job running 



on a remote host before the first submitted job has fin- 
ished executing This second job submisskwi is coordi- 
nated with the processes already running for the first job 
submisskjn as foltows. 

s First, the user starts the submit program a second 
time with the second job name as an argument (in the 
same manner as described above for the first job). This 
second starting of the submit program creates a second 
submit process just for managing the remote execution 

'0 of the second job. The second submit process spawns 
a second coshell client process (similar to coshell proc- 
ess 104) whi:h is also just for the remote execution of 
the second job. The second coshell client process, how- 
ever, communicates with the same coshell which the 

IS first coshell client process communicates with. 

The second coshell then selects a remote host for 
the second job to execute upon and starts a ksh on that 
remote host - unless the coshell already has a ksh run- 
ning on t|-iat remote host. For example, if the coshell 

20 chooses the same remote host upon which the first job 
is executing, then the coshell will use the same ksh in 
which the first job is executing. Regardless of whether 
the coshell needs to create a new ksh or not, within the 
selected ksh a second LDSTUB process, just for the re- 

25 mote execution of the second job, is started. The second 
LDSTUB process spawns the second job and a second 
monitor daemon. 

In general then, it can be seen that If a user has N 
jobs remotely executing which have all been submitted 

30 from the same local host, then the user will have a 
unique set ol submit, client, LDSTUB, monitor daemon 
and job piocBsses created for each of the W jobs sub- 
mitted. All of the A/jobs will share the same coshell dae- 
mon. Any of the N fbs executing on the same remote 

3S hopt will Shane the same ksh. 

The remainder of this descriptkin addresses the op- 
eration of the present inventkxi with respect to a single 
remotely executing job for one user of a local host. For 
any additional remotely executing jobs submitted by the 

40 same user from the same bcal host, each such addi- 
tional job will have its own unique set of independent 
processes wtiich will respond to conditions in the same 
manner as described below. 

At the point in the job submission process when a 

4S user's job is remotely executing, one of two main con- 
ditions will o(«ur: (1) one or more of the three conditions 
described above as being checked for by a nwnitor dae- 
mon is satisfied causing the user's job to be migrated to 
another remi3te host; or (ii) the remotely executing job 

so will finish. The process is initiated, at this point in the job 
submissbn process, by each of these two main condi- 
tions occurring as described, in turn, betow. 

If the first main condition (whk:h may be one or more 
of the three condittons checked for by a monitor daemon 

ss occurring) is satisfied, the following steps will occur in 
order to migrate the user's job to another remote host. 
Each of these steps will be explicated by reference to 
the specific process configuration of FIG. 1. 
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First, the satisfaction of tfie first main condition 
causes the monitor daemon to exit. The exiting of the 
monitor daemon causes a "SIGCHLD" signal to be sent 
to its parent LDSTUB process. In FIG. 1. the exiting of 
monitor daemon 108. upon the satisfaction of the first 
main condition, causes a 'SIGCHLD' signal to be sent 
to nronitor daemon 1 0S's parent process LDSTUB 1 07. 

Second, the LDSTUB process kills the user's job. 
In FIG. 1, LDSTUB process 107 kills user Xs job 109. 
If the user's job has been linked with Libckp, as is the 
case for job 1 09, a subsequent restart of the job will only 
mean that work since the last checkpoint is lost. Typi- 
cally, Libckp performs a checkpoint every 30 minutes so 
that only the last 30 minutes of job executkin. at most, 
will be lost. 

The third step is as foltows. There is a socket con- 
nection between an LDSTUB process and Its submit 
process. Over this socket connection the LDSTUB proc- 
ess sends a message to its submit process telling it to 
have the user's job run on another remote host. LD- 
STUB process 1 07 sends such a message over socket 
1 1 2 to submit process 1 02. 

In the fourth step, the submit process kills the client 
process it had originally spawned off to communfcate 
with the remote job through coshell. The submit process 
also kills its LDSTUB process. In the case of FIG. 1 , sub- 
mit process 102 kills client process 103 and LDSTUB 
process 107. 

In the fifth step, the submit process resubmits the 
user's job to another remote host according to the sub- 
mission process described above. Submit process 102 
resubmits job 109 to coshell 104 for execution on an- 
other remote host. The subsequent processes and the 
other remote host which would be a part of job 109'8 
resubmission are not shown in FIG. 1. When the user's 
job is executed on another remote host, the first actton 
of the job is to check for a checkpoint file. If the job finds 
a checkpoint file it restores the state of the job as of the 
point In time of the last checkpoint. 

It is important to note that the ksh under which the 
user's job had been running, along with its pipe connec- 
tions to its coshell, is kept running. This ksh may be used 
again later by its coshell provided that the remote host 
upon which the ksh is running becomes idle again by 
satisfying the three conditions described above for a 
candidate remote host. In the case of FIG. 1 , ksh 106 is 
kept running along with its command pipe 116 and sta- 
tus pipe 120 connections to coshell 104. 

Alternatively, if the second main condition de- 
scribed above (of the remotely executing job finishing) 
is satisfied, the following steps will occur. As with the 
first main condition, each of these steps will be explicat- 
ed by reference to the specific process configuratton of 
FIG. 1. 

First, when the user's job finishes execution on a 
remote host, it exits causing a "SIGCHLD" signal to be 
sent to its parent process (its LDSTUB process). In re- 
sponse to receiving the "SIGCHLD" signal, LDSTUB ob- 
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tains the exit status number (a status number is returned 
by every exiting Unix process to indicate its completion 
status) of th6! user's job with the Unix system call wait 
(2). When user Xs job 109 finishes executbn on remote 

s host 101 it exits causing a 'SIGCHLD" signal to be sent 
to its parent LDSTUB process 107. LDSTUB process 
107 then obtains job 109's exit status number with the 
waH(2) system call. 

Second, the LDSTUB process kills its monitor dae- 

<o mon. LDSTUB process 107 kills monitor daemon 108. 
The thirC step is as follows. There is a socket con- 
nection between the LDSTUB process and its submit 
process. Over this socket connection the LDSTUB proc- 
ess sends a message to its submit process telling it that 

IS the user's job has finished and containing the status 
number returned by the user's job. LDSTUB process 
107 sends such a message over socket 112 to submit 
process 102. 

In the fourth step, the submit process kills the client 

20 process it haid originally spawned off to communicate 
with the remote job through coshell. The coshell then 
detects that a particular client process has been killed 
by a particular signal number In response to detecting 
this the coshell sends a message to the appropriate ksh. 

2S The message sent by the coshell instructs the ksh to kill 
the LDSTUB process, corresponding to the submit proc- 
ess, using the same signal number by whk:h the submit 
process killed its client process. In the case of FIG. 1, 
submit process 102 kills client process 103, Coshell 104 

30 then detects that client process 103 has been killed by 
a particular signal number. In response to detecting this, 
coshell 1 04 sends a message to ksh 1 06. The message 
sent by coshell 104 instructs ksh 106 to kill LDSTUB 
process 1 07 with the same signal number by which sub- 
as mit process 102 killed client process 103. 

In the fifth step, the submit process exits with the 
same status number returned by the remotely executed 
job. Submit process 102 exits with the status number 
returned by ttie execution of job 1 09 on remote host 1 01 . 

40 As with the migration process described above, it is 
important to note that the ksh under which the user's job 
had been running, along with its pipe connections to its 
coshell, is kept running. This ksh may be used again 
later by its coshell provided that the remote host upon 

^ which the ksh is running becomes idle again by satisfy- 
ing the three i^ondittons described above for a candklate 
remote host. In the case of FIG. 1, ksh 106 is kept nin- 
ning along with its command pipe 116 and status pipe 
1 20 connections to coshell 104. 

so The workstations shared through the present inven- 
tion can be all of one type (homogeneous workstations). 
An example is a network where only Sun Microsystems 
workstations can be shared. Alternatively, the present 
invention can be used to share a variety of workstation 

55 types (heterogeneous workstations). An example is a 
network whe^re both Sun N/licrosystems workstations 
and SilKon Giraphics workstations can be shared. 
Where the present invention is used with heteroge- 
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neous workstations, the migration of a job, based upon 
workstation type, is typkially as follows. Where no type 
is specified upon submitting a job, the job is limited to 
migrate among workstations of the same type as which 
the job was submitted from. The user can specify how- 
ever, upon submitting a job, that the job only be execut- 
ed upon workstattons of one partKular type and this type 
need not be the same as the type of workstation from 
which the job was submitted. The user can also specify, 
upon submitting a job, that the job can be executed on 
any workstation type. 

Persons skilled in the art will appreciate that the 
present inventwn may be practiced by other than the 
described embodiments, which are presented for pur- 
poses of illustratnn and not of limitation, and the present 
inventkjn is limited only by the claims which foltow 



Claims 

1 . A system for sharing computer resources among a 
plurality of computers connected together by a net- 
work, comprising: 

a local computer that accepts jobs for remote 
executton, the host computer being one of the 
plurality of computers; 

a remote computer that is capable of receiving 
jobs for remote execution, the remote computer 
being one of the plurality of computers and ex- 
ecuting a current status process to collect cur- 
rent information that includes activity infontia- 
tran about primary users of the remote compu- 
ter; and 

a scheduling computer that chooses a kxation 
for remote execution of the accepted jobs 
based upon at least the primary user current 
infomiation regarding the remote computer, the 
scheduling computer being one of the plurality 
of computers. 

2. The system of claim 1 , wherein the scheduling com- 
puter and the local computer are the same compu- 
ter. 

3. The system of claim 1 , wherein the scheduling com- 
puter and the remote computer are the same com- 
puter. 

4. The system of claim 1 , wherein the local computer, 
the remote computer and the scheduling computer 
are three different computers of the plurality of com- 
puters. 

5. The system of claim 1 , wherein the scheduling com- 
puter capability of choosing a location for remote 
execution is executed as a decentralized schedul- 
ing process among the plurality of computers. 



6. The systom of claim 1 , wherein the local computer 
executes a submission process in response to a job 
being sulDmitted for remote execution by a user of 
the local computer. 

s 

7. The system of clam 1 , wherein the remote compu- 
ter also executes: 

a monitor process that monitors the activity of 
primary users of the remote computer. 

10 

8. The system of claim 1 , wherein the activity informa- 
tion includes time elapsed since a last activity on an 
input devce by a primary user of the remote com- 
puter. 

IS 

9. The system of claim 8, wherein the input dqvfee is 
directly connected to the remote computer. 

10. The system of claim 8, wherein the input devce is 
20 connected to the remote computer through a re- 
mote login process. 

11. Thesystemof claim 1, wherein the scheduling com- 
puter chcK)ses a remote computer as the computer 

2S to remotely execute a job only if the time elapsed 
since a l£ist activity on the input device by a primaiy 
user of tfie remote computer is greater than a pre- 
determined threshold. 

30 12. The system of claim 1 , further comprising a file sys- 
tem shared by at least the local and remote com- 
puters. 

13. The system of daim 12, wherein a primary user of 
3S the remote computer is determined by attribute in- 
formation stored in a central kxatton in the file sys- 
tem. 

14. The system of claim 13, wherein the scheduling 
^0 computer chooses a location based on current in- 
formation kept in the file system and the current sta- 
tus process on each remote computer posts the cur- 
rent information it collects to the central kxation In 
the file system. 

45 

15. A systen-i for sharing computer resources among a 
plurality of computers connected together by a net- 
work, comprising: 

a local computer that accepts jobs that a user 
has submitted to the kjcal computer for execu- 
tion on a remote computer; and 
a remote computer, selected by a scheduling 
process, that executes the job submitted by the 
^ user, and executes a monitor process that 

causes the execution of the job to be suspend- 
ed if the monitor process detects activity on the 
remote computer by a primary user of the re- 
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mote computer. 

16. The system of claim 15, wherein the remote com- 
puter suspends execution of the job if activity by a 
primary user of the remote computer is detected on 
an input device of the remote computer. 

17. The system of claim 1 5, wherein the local computer 
executes a submission process such that if the job 
Is suspended by the remote computer, the job is re- 
submitted by the submission process to the sched- 
uling process for execution on a second remote 
computer. 

18. The system of claim 15, wherein the monitor proc- 
ess is spawned from a parent process. 

19. The system of claim 15, wherein the monitor proc- 
ess is a fine-grain polling process. 

20. The system of claim 15, wherein the job is linked 
with a checkpointing library 

21. The system of claim 15, further comprising a file 
system shared by at least the kical and remote com- 
puters. 

22. The system of claim 21 , wherein a primary user of 
the remote computer is determined by attribute in- 
foroiation stored in a central location in the file sys- 
tem. 

23. The system of claim 15, wherein the scheduling 
process is a decentralized scheduling process. 

24. The system of claim 23. wherein the decentralized 
scheduling process is executed on the local com- 
puter that submits a job for remote executbn. 

25. The system of claim 15, wherein the scheduling 
process is executed by a single processor for the 
entire network. 

26. A method of sharing computer resources among a 
plurality of computers connected together by a net- 
work comprising the steps of: 

accepting jobs for remote execution on a first 
computer; 

executing a scheduling process to schedule the 
remote execution of the accepted jobs based 
on current information about each of the plural- 
ity of computers that may be remotely ac- 
cessed, the current informatfexi including activ- 
ity information about primary users of each of 
the remotely accessible computers; and 
running a current status process on at least one 
second computer of the plurality of computers, 
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the second computer being one of the remotely 
accessible computers, the current status proc- 
ess operating to collect current information 
aboirt the second computer. 

5 

27. The method of claim 26, furthercomprising the step 
of: 

exeicuting a submisskxi process on the first 
compute ' in response to a job being submitted by a 
w user of the first computer. 

28. The method of claim 26, wherein the step of running 
a current status process collects activity information 
that includes time elapsed since a last activity on 

« an input device by a primary user of the second 
computer. 

29. The mettiod of claim 26, wherein the step of exe- 
cuting a scheduling process chooses a remote com- 

20 puter for remote executton of the submitted job only 
if the tima elapsed since a last activity on an input 
devfce by a primary user of the remote computer is 
greater tfian a predetermined threshold, the remote 
computer being one of the second computers. 

2S 

30. The metiiod of claim 29, further comprising the 
steps of: 

executing one of the accepted jobs on the re- 
30 mote computer chosen by the scheduling proc- 

ess; and 

monitoring the remote computer for activity by 
a prhnary user of the remote computer. 

35 31 . The method of claim 30, further comprising the step 
of: 

suspending the remote execution of the job 
when activity by a primary user is sensed during the 
step of rrionitoring. 

40 

32. The method of claim 31 , further comprising the step 
of: 

resubmitting the suspended job to the sched- 
uling prosess for execution on a different remote 
« computer when activity by a primary user is sensed 
during the step of nwnitoring. 

33. The method of claim 30, further comprising the step 

of: 

linking the job to a checkpointing library prior 
to executing the job. 

34. The metfiod of claim 26, wherein the step of exe- 
cuting a scheduling process is executed as a de- 

55 centraliz«!d scheduling process. 

35. The mett-iod of claim 26, wherein the step of exe- 
cuting a scheduling process Is executed as a cen- 
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scheduling process on a single processor. 
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